Gradient Descent

No Mercy for High Loss - Only Descent

Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks.

Gradient Descent (GD) is an optimization algorithm used to minimize the cost or loss function in machine learning and deep learning. It iteratively adjusts parameters (weights) to reduce the error between predicted and actual outcomes. The idea is to move towards the minimum value of the loss function by following the negative gradient. The cost function represents how well a model fits the data. Gradient descent seeks to minimize this cost.

Gradient descent is a way to minimize an objective function \(J(\theta)\), parameterized by a model’s parameters \((\theta \in \mathbb{R}^d )\), by updating the parameters in the opposite direction of the gradient of the objective function, \(\nabla J(\theta)\), with respect to the parameters \(\theta\).

The learning rate, \(\eta\), determines the size of the steps we take to reach a (local) minimum.

Why Gradient Descent?
It works well for optimizing high-dimensional problems, such as neural networks.

#1 Steps Involved in Gradient Descent

1. Initialize the Parameters

  • Begin by initializing the model parameters \(\theta\) with some values (usually small random numbers).
  • These parameters will be updated iteratively to minimize the objective function.

2. Calculate the Gradient

  • Compute the gradient of the objective function \(J(\theta)\) with respect to the parameters \(\theta\). This is the derivative \(\nabla J(\theta)\), which indicates the direction of the steepest ascent.

3. Update the Parameters

  • Update the parameters \(\theta\) in the direction opposite to the gradient to minimize the objective function: \[ \boldsymbol{\theta_{new} = \theta_{old} - \eta \cdot \frac{d}{d\theta}(J(\theta))} \] where \(\eta\) is the learning rate, controlling the step size.

4. Repeat Until Convergence

  • Continue iterating through steps 2 and 3 until the algorithm converges, i.e., the changes in the parameters become very small or the objective function reaches a desired minimum. e.g., \[\theta_{old} - \theta_{new} \approx 0.002\]

5. (Optional) Check for Convergence Criteria

  • Convergence can be determined by one or more of the following:
    • The gradient \(\nabla J(\theta)\) becomes sufficiently close to zero.
    • The change in the value of \(J(\theta)\) between iterations falls below a predefined threshold.
    • A fixed number of iterations is reached.

6. Return the Final Parameters

  • Once the parameters have converged, return the final optimized values of \(\theta\).

Gradient Descent Algorithm

Gradient Descent Pseudo-code

  1. Initialize parameters \(\theta\) (e.g., \(\theta_0, \theta_1, ..., \theta_n\)) randomly or with zeros
  2. Choose learning rate \(\eta\) (a small positive number)
  3. Set the number of iterations (max_iters) or define a convergence criterion

Repeat until convergence or for max_iters times:

  1. Calculate the gradient of the cost function \(J(\theta)\) with respect to each parameter \(\theta\)
    • For each parameter \(\theta_i\): \[\text{gradient}_i = \left[\frac{\partial}{\partial \theta_1}J(\theta), \frac{\partial}{\partial \theta_2}J(\theta), ... , \frac{\partial}{\partial \theta_i}J(\theta)\right]\] (This is the partial derivative of the cost function with respect to \(\theta_i\))
  2. Update the parameters \(\theta\) using the gradient and learning rate \(\alpha\):
    • For each parameter \(\theta_i\): \[\theta_i = \theta_i - \eta \cdot \text{gradient}_i\]
  3. Optionally: Check if the change in the cost function or parameters is small enough (convergence criterion)

Return the optimized parameters \(\theta\)

Taking an example

Let say, if objective function \(J(\theta)\) where \(\theta \in (m, c)\), then Gradient descent will be

repeat until converge {
m[i] = m[i-1] - (learning rate * slope_m)
c[i] = c[i-1] - (learning rate * slope_c)
}

where, slope_m = \(\large \frac{\partial J}{\partial m}\) and slope_c= \(\large \frac{\partial J}{\partial c}\)

  1. If Slope is positive:
    m[i] = m[i-1] - [(learning rate) * (+ slope_m)]

  2. If Slope is negative:
    m[i] = m[i-1] - [(learning rate) * (- slope_m)]
    m[i] = m[i-1] + [(learning rate) * (slope_m)]

Model is said to be converged when m[i] - m[i-1] ≈ 0.01

Finding Minimum from any point

#2 Learning Rate η

The learning rate is a hyperparameter in machine learning that controls the step size at which the parameters \(\theta\) of a objective function \(J(\theta)\) are updated during training.
It’s important to achieve a desirable learning rate because both low and high learning rates can waste time and resources.
- Too large: The algorithm might overshoot the minimum, causing divergence.
- Too small: The convergence will be slow, and it might get stuck in local minima.

A desirable learning rate is one that’s low enough so that the model/network converges to something useful but high enough so that it can be trained in a reasonable amount of time.

Learning Rate effect

image.png

image.png

image.png

#3 Some Concepts

Before understanding Gradient descent, we need to learn some things:

1. Loss function

This represents the error for a single training example. It measures how far off a model’s prediction is from the actual value for a single data point. For example, in regression, the loss for a single example could be the squared error: \[\text{Loss} = (y_i - \hat{y_i})^2\]

2. Cost function

The cost function is typically the average of the loss over the entire training set, giving an overall measure of model error across all data points. For example, in Mean Squared Error (MSE), the cost function over \(m\) np. of samples is: \[J(W) = \frac{1}{m}\sum_{i=0}^{m}(y_i - \hat{y_i})^2\]

3. Convex function

A convex function is a type of function in mathematics and optimization that has a shape where any line segment drawn between two points on the function’s curve lies above or on the curve itself. Convex functions have specific properties that make them particularly important in optimization because they are easier to work with, especially when it comes to finding global minima.

For a convex function:

  • Upward Curvature: The function curves upward or is a straight line.
  • Single Global Minimum: There is either a unique global minimum (the lowest point on the curve) or the function is flat (in which case every point is a minimum).

Example:
\(f(x) = x^2\)
\(f(x) = e^x\)
\(f(x) = ax+b\) etc.

Code
import matplotlib.pyplot as plt
import numpy as np
ipt = np.linspace(-10,10,100)
plt.style.use('fivethirtyeight')
plt.figure(figsize=(5,3))
plt.plot(ipt**2, label="f(x)")
plt.axhline(80, color='black')
plt.title("Convex function", color='brown')
plt.axis('off')
plt.legend()
plt.show()

4. Non-Convex function

A non-convex function have multiple peaks and valleys, resulting in multiple local minima and local maxima. This makes finding the global minimum (the lowest point) challenging because optimization algorithms like Gradient Descent can easily get stuck in one of these local minima rather than finding the lowest possible value.

The loss functions used in deep learning (e.g., mean squared error or cross-entropy) are generally non-convex due to the large number of parameters and the complex network structure, leading to highly irregular optimization landscapes.

For a non-convex function:

  • Multiple Local Minima and Maxima: non-convex functions often have several local minima and maxima.
  • Complex landscape: Non-convex functions often have irregular, oscillating shapes. This creates a “rough” landscape with many ups and downs.
  • Curved in Different Directions: Non-convex functions may have different curvatures in different directions, which can create valleys, peaks, and saddle points.

Examples:
\(f(x) = x^4 - x^3\)
\(f(x) = \text{sin}(x)\) etc.

Code
import matplotlib.pyplot as plt
import numpy as np
import matplotx
plt.style.use('fivethirtyeight')
ipt = np.linspace(0,600,100)*np.pi/180
plt.figure(figsize=(5,3))
plt.plot(np.sin(ipt)+0.08*ipt, label='f(x)')
plt.axhline(0.25, color='black')
plt.axis('off')
plt.title("Non Convex function", color='brown')
plt.legend(loc='lower left')
plt.show()

#4 Types of Gradient Descent

There are three variants of Gradient Descent, which differ in how much data we use to compute the gradient of the objective function.
Depending on the amount of data, we make a trade-off between the accuracy of parameter update and the time it takes to perform an update.

1. Batch Gradient Descent (Vanilla)

Batch Gradient Descent is an optimization algorithm used to minimize the cost function of a model by updating its parameters in the direction that reduces the overall error.

The gradient is calculated using the entire dataset at each iteration. The term “batch” indicates that we use the full batch of data points in each update. This makes the updates more stable but can be computationally expensive if the dataset is very large.

Consider this psudo code:

Set learning rate η
Set number of epochs
for each epoch in range(epochs):
   # Step 1: Forward Pass i.e., Predict y_hat = X * W
   y_hat = np.dot(X, W)

   # Step 2: Calculate Cost function (e.g., Mean Squared Error)
   Cost, L = (1 / m) * sum((y_hat - y) ** 2) # m : no. of samples

   # Step 3: Calculate Gradient
   gradient = dL/dW

   # Step 4: Update Weights
   W = W - η * gradient

   # Optionally print or store loss for monitoring
   print(“Epoch:”, epoch, “Loss:”, loss)

   Return final weights W

Explanation

1. Initialize weights: Start with random weights \(W\)
2. Learning rate and Epochs: Define the learning rate \(\eta\) and the number of epochs.
3. Loop through epochs:
   3.1 Calculate Predictions: Compute predictions \(\hat{y}\) for each data point \(x_i\) using the dot product:
   y_hat = np.dot(X,W)
   3.2 Compute Cost Function: Use the appropriate Cost function \(J(W)\). This provides a measure of the model’s error with the current Weights for entire dataset.
   3.3 Calculate Gradient \(\boldsymbol{\nabla_WJ(W)}\): The gradient is the derivative of the Cost function w.r.t each weight \(W\), \(\large \frac{\partial J(W)}{\partial w}\) , which we use to update weights \(W\).

4. Update Weights: Update the weights \(W\) by moving in the opposite direction of the gradient:\[\boldsymbol{W = W - \eta \cdot \nabla_{W} J(W)}\]

Repeat steps 3.1 - 3.3 for a specified number of epochs or until the cost function converges (i.e., does not decrease significantly with further updates).

Number of times weight update = number of epochs.

Gradient Descent Convergence

Contour plot of Batch GD
Advantages
  1. Steady and Accurate Convergence to a Minimum: Batch Gradient Descent will normally yield a direct path to a minimum. This means little time will be spent moving in incorrect directions in weight space if the learning rate \(\eta\) is not too big. This also means Batch Gradient Descent will yield a very accurate estimate of the minimum as shown above.

  2. Easy Implementation: The Batch Gradient Descent algorithm is quite simple to implement in code, when compared to other variations. The number of hyperparameters introduced is kept to a minimum, and the implementation can be fully vectorised.

  3. Excellent for Convex Functions: Batch Gradient Descent works very well for convex functions with smaller-sized datasets. In such a scenario, all derivatives point to the single global minimum, and there will be no issue loading all the training data into memory for processing.

Limitations
  1. Unpractical for Larger Datasets: As all the training data is used to compute each step in our algorithm, this approach is very computationally expensive. All the training data must be loaded into memory. For Big Data, this simply isn’t feasible.
  2. Slow to Converge: Related to the first point; since all the training data is used to compute each step, this puts strain on the processing resources we have available. The algorithm may only slowly converge towards a minimum for larger datasets.
  3. Difficult to Escape Local Minima: For non-convex loss functions, many local minima will be present in weight space. it tends to lead towards a direct path to the nearest minimum. As such, this algorithm can get stuck at a local minimum in these situations. Batch GD tends to lead towards a direct path to the nearest minimum. As such, this algorithm can get stuck at a local minimum in these situations.